Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limma mixed models feature #6753

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Limma mixed models feature #6753

wants to merge 7 commits into from

Conversation

KamilMaliszArdigen
Copy link
Contributor

@KamilMaliszArdigen KamilMaliszArdigen commented Oct 8, 2024

This extends limma functionality with option to process rna-seq data in pairwise and mixed mode. It will be used in differentialabundance pipeline and in future can be used in other pipelines which are using limma.

PR checklist

Closes #XXX

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs
  • If necessary, include test data in your PR.
  • Remove all TODO statements.
  • Emit the versions.yml file.
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label
  • Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • nf-core modules test <MODULE> --profile docker
      • nf-core modules test <MODULE> --profile singularity
      • nf-core modules test <MODULE> --profile conda
    • For subworkflows:
      • nf-core subworkflows test <SUBWORKFLOW> --profile docker
      • nf-core subworkflows test <SUBWORKFLOW> --profile singularity
      • nf-core subworkflows test <SUBWORKFLOW> --profile conda

@pinin4fjords pinin4fjords self-requested a review October 8, 2024 11:41
@@ -4,17 +4,19 @@ process LIMMA_DIFFERENTIAL {

conda "${moduleDir}/environment.yml"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/bioconductor-limma:3.54.0--r42hc0cfd56_0' :
'biocontainers/bioconductor-limma:3.54.0--r42hc0cfd56_0' }"
'oras://community.wave.seqera.io/library/bioconductor-edger_bioconductor-ihw_bioconductor-limma_r-dplyr_r-readr:7fc48564d286c1c6' :
Copy link
Member

@maxulysse maxulysse Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't have any oras protocol in the modules, it doesn't work with NXF_SINGULARITY_CACHEDIR
Can you swittch to https?

'https://depot.galaxyproject.org/singularity/bioconductor-limma:3.54.0--r42hc0cfd56_0' :
'biocontainers/bioconductor-limma:3.54.0--r42hc0cfd56_0' }"
'oras://community.wave.seqera.io/library/bioconductor-edger_bioconductor-ihw_bioconductor-limma_r-dplyr_r-readr:7fc48564d286c1c6' :
'community.wave.seqera.io/library/bioconductor-edger_bioconductor-ihw_bioconductor-limma_r-dplyr_r-readr:edea0f9fbaeba3a0' }"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'community.wave.seqera.io/library/bioconductor-edger_bioconductor-ihw_bioconductor-limma_r-dplyr_r-readr:edea0f9fbaeba3a0' }"
'nf-core/bioconductor-edger_bioconductor-ihw_bioconductor-limma_r-dplyr_r-readr:edea0f9fbaeba3a0' }"

I pulled and pushed to quay.io

So we can change the registry from all container in a simpler way.
We'll be using community.wave.seqera.io as registry when we do the switch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not keen on this @maxulysse, there's no way for someone without privilege to do this and maintain the module.

Copy link
Member

@pinin4fjords pinin4fjords left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few changes I'd like to see here.

I understand the attraction of readr etc, but until we have full Seqera Containers integration with nf-core tooling (which shouldn't be long), extra packages create challenges.

The forking of logic is also a bit ugly. If the logic is sufficiently different to require two scripts it should be two modules- though I'd favour integration if it didn't create excess complexity in the R code.

dependencies:
- bioconda::bioconductor-limma=3.54.0
- bioconda::bioconductor-edger=4.0.16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This multi-package thing still creates difficulties right now- which is why I wrote this module depending on a single Biocontainer.

@@ -4,17 +4,19 @@ process LIMMA_DIFFERENTIAL {

conda "${moduleDir}/environment.yml"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/bioconductor-limma:3.54.0--r42hc0cfd56_0' :
'biocontainers/bioconductor-limma:3.54.0--r42hc0cfd56_0' }"
'oras://community.wave.seqera.io/library/bioconductor-edger_bioconductor-ihw_bioconductor-limma_r-dplyr_r-readr:7fc48564d286c1c6' :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These oras links don't work well with nf-core tooling right now, unfortunately

@@ -23,5 +25,10 @@ process LIMMA_DIFFERENTIAL {
task.ext.when == null || task.ext.when

script:
template 'limma_de.R'
if (type == 'rnaseq') {
template 'limma_de_rnaseq.R'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not massively keen on the parallel script.

Either the new thing should be a separate module, or they should be properly integrated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How you envision "proper" integration? As single template? I will refactor that if needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, if the two scripts share enough logic, they should be one with a simple conditional. If the logic is quite divergent such that doing that adds too much complexity, these should be separate modules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I had similar doubts in the one hand this is still limma and differential abundance analysis and on the other site this one is combined with voom and there is lot of differences. So maybe we will create new module named limma/differential-voom? or simply limma/voom - what are your thoughts on the module naming?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be limma/differential-voom.

As in the comments below, I also think this should be, as much as possible, a thin wrapper around Limma functions only. Otherwise it's just your custom script, rather than a 'standard' Limma module.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updating this older comment after newer ones below: I don't believe additional logic required for Voom merits its own module. We can add the Voom part in a conditional, and other changes (e.g. duplicateCorrelation) etc, apply equally to non-Voom Limma.

- bioconda::bioconductor-limma=3.54.0
- bioconda::bioconductor-edger=4.0.16
- bioconda::bioconductor-ihw=1.28.0
- bioconda::bioconductor-limma=3.58.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limma is a dependency of edger, you don't need to include it here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true however without this we won't be able to control limma version which will be used

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only pin the primary dependency in the modules repo, unless you have a really compelling reason to do otherwise. We will at some point have lock files that pin all dependencies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(as in other comments, I'm not sure we need edgeR here, so we'd only pin Limma)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we noticed that results are slightly different depending on limma versions so we decided to pin it to achieve reproducibility. I will adopt your comment in standalone module

Copy link
Member

@pinin4fjords pinin4fjords Oct 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I see your point on this one, apologies. Limma is the primary dependency so it's pinned, but we need edgeR.

dim(DGE)

# Calculate normalization factors
DGE <- calcNormFactors(DGE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I'm wrong, but this and the DGEList are the only things that need edgeR here, right? voom() will do the normalisation things, and will just take a matrix. Then we don't need the edgeR dependency.

Copy link
Contributor Author

@KamilMaliszArdigen KamilMaliszArdigen Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use edgeR dependency to read raw matrix count to create DGEList object which is one of required ways to provide expression data to voom: "counts - a numeric matrix containing raw counts, or an ExpressionSet containing raw counts, or a DGEList object." : https://www.rdocumentation.org/packages/limma/versions/3.28.14/topics/voom

So we either need BioBase or edgeR dependency.

Copy link
Member

@pinin4fjords pinin4fjords Oct 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, as I say, we don't need edgeR, and that will simplify the software dependencies

Actually, after reminding myself, I take this back, think it probably does need the DGEList etc the way you're doing.

## Generate outputs
# Differential expression table

if (opt\$IHW_correction) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO if we're writing 'limma' modules they should be thin-as-possible wrappers around Limma functions. We shouldn't we embedding lots of additional logic, aside from the boilerplate required to pass inputs to those functions correctly and write the outputs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extensive customisations should probably be in local modules in the workflow, and we'd want to discuss to determine if this was part of standard best practice.

Copy link
Member

@pinin4fjords pinin4fjords left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just had some time to review this again, and I think the R part also needs work. There's no need to treat the pairing variable in a special way as far as I can tell, it can be a blocking variable like any other. At that point, the main difference between the two modes is the use of duplicateCorrelation, for which there could just be a separate flag.

Fundamentally, this would integrate much better into the existing differentialabundance logic if started from the existing limma module.

I actually don't think the logic is very different with using Voom, so for basic Voom integration the existing module can just have a conditional in it something like this, about here:

if (!is.null(opt$use_voom) && opt$use_voom) {
    # Create a DGEList object for RNA-seq data
    dge <- DGEList(counts = intensities.table)
    
    # Normalize counts using TMM
    dge <- calcNormFactors(dge, method = "TMM")
    
    # Run voom to transform the data
    voom_result <- voom(dge, design)
    data_for_fit <- voom_result
} else {
    # Use as.matrix for regular microarray analysis
    data_for_fit <- as.matrix(intensities.table)
}

All the other logic can be the same, and then we can add in any features you think are missing from there.

# Calculate normalization factors
DGE <- calcNormFactors(DGE)

if (opt\$analysis_type == "pairwise") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (opt\$analysis_type == "pairwise") {
if (opt\$analysis_type == "pairwise") {

I don't think you even need this extra mode. Sample pairing can be handled as a blocking variable, no? Even if it was needed, you're duplicating a lot of code between the two modes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we are duplicating some code. Unfortunately the mixed model integration have large amount of small additions which resulted in multiple if's example can be voom which is called twice. I will try to incorporate voom into original script. I'm not yet fulli convinced about mixedmodel mode.

@pinin4fjords
Copy link
Member

Just to re-state the above, and after refreshing myself on the Voom logic, I actually don't think Voom needs to be its own module. I think this can be integrated into the existing module.

The voom-specific component is a matter of a few lines. All the other changes are not Voom specific.

@KamilMaliszArdigen
Copy link
Contributor Author

KamilMaliszArdigen commented Oct 14, 2024

Just to re-state the above, and after refreshing myself on the Voom logic, I actually don't think Voom needs to be its own module. I think this can be integrated into the existing module.

The voom-specific component is a matter of a few lines. All the other changes are not Voom specific.

I missed your earlier comment I will take a closer look on this one and figure out how to incorporate mixedmodel in the nice clean way.

At least thanks to new module we have nf-test almost ready :D

@KamilMaliszArdigen
Copy link
Contributor Author

@pinin4fjords I've extended original limma module with voom and IHW correction. This is still a draft as my results are now significantly different and sitll this have to be sorted out. I wanted to ask you if you fill that this is a right direction. Also any suggestion regarding incorporation of mixed models logic will be welcome :)

Thank you in advance.

Copy link
Member

@pinin4fjords pinin4fjords left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the ongoing work- this is definitely going the right way.

But I still think the IHW code doesn't belong in a module wrapping standard Limma functions.

@@ -322,18 +357,54 @@ ebayes_args = c(

fit2 <- do.call(eBayes, ebayes_args)

# Run topTable() to generate a results data frame
if ((!is.null(opt\$use_voom) && opt\$use_voom) && (!is.null(opt\$IHW_correction) && opt\$IHW_correction)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, but I don't think the IHW correction belongs in this module.

nf-core modules should be thin wrappers around underlying software. We have some unavoidable boilerplate when we're wrapping library code like this, but we shouldn't be adding to that with our own judgement calls- it makes the code harder to maintain, and harder for users expecting Limma-native functionality to understand.

data_for_fit <- voom_result

# Write the normalized counts matrix to a TSV file
normalized_counts <- voom_result\$E
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're adding export of normalised values (which I think is a good thing) we should do the same for the microarry (e.g. with normalizeBetweenArrays), and do the writing outside this conditional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants